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ABSTRACT 

The effects of the violation of the assumption of 
normality coupled vlth the condition of Bulticollinearity upon the 
outcome of testing the hypothesis Beta equals zero in the 
tvo-*predictor regression equation is investigated* A monte carlo 
approach vas utilized in which three differenct distributions vere 
sampled tor tvo sample sizes over thirty^four population correlation 
matrices. Tue preliminary results indicate that the violation of the 
assumption of normality has significant effect upon the outcome of 
the hypothesis testing procedure. As vas expected, however, the 
population correlation matrices with extremely high collinearity 
between the independent variables resulted in large standard errors 
in the sampling distributions of the standardized regnession 
coefficients. Also, these same population correlation matrices 
revealed a larger probability of committing a type II error* Hany 
researchers rely on beta weights to measure the importance of 
predictor variables in a regression equation, flith the presence of 
multicollinearity, however, these estimates of population 
standardized rogression weights will be subject to extreme 
fluctuation and should be interpreted with caution, especially when 
the sample size involved is relatively small. (Author/HC) 
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Abstract 



Tnis study investigated the effects of the violation of the assump- 
tion of normality coupled with the condition of multicollinearity 

t 

upon the outcome of testing the hypothesis 3 « 0 in the two-pre- 
dictor regression equation. A monte carlo approach was utilized in 
which three different distributions were sampled for two sample sizes 
over thirty-four population correlation matrices. The preliminary 
results indicate that ncith w r the violation of the assumption of nor- 
mality mi Lhu iiL»ium__ iiT ■iii lif i nllf nnnrit" has -eay significant effect 
upon the outcome of the hypothesis testing procedure. As was expected, 
however, the population correlation matrices v/ith extremely high col- 
linearity between the independent variables resulted in large stan- 
dard errors in the sampling distributions of the standardized re- 
gression coefficients. Also, these same population correlation 
matrices revealed a larger probability of committing a type II error. 
Many researchers rely on beta weights to measure the importance of 
predictor variables in a regression equation. With the presence of 
multicollinearity, however, these estimates of population standardized 
t egression weights will be subject to extreme fluctuation and should 
be interpreted with caution, especially when the sample size involved 
is relatively small. 
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The Effect of Multicollinearity and the Violation 
of the Assumption of Normality on the Testing 
of Hypotheses in Regression Analysis 

One of the goals of applied research is to define functional 
relationships among variables of interest. If such relationships 
can be found, then this knowledge can be used for prediction pur- 
poses. For example given a subject's scores on selected X variables, 
the mathematical relationship can be utilized to predict that same 
subject's score on the associated Y variable. If the relationship 
is not a stable one, then perfect prediction is not possible. This 
is generally the situation that exists in social science research. 
The best that a prediction rule can do is to provide a *good* fit to 
the data. Nevertheless, knowledge of such a rule can greatly decrease 
the errors in prediction and can be of practical utility in behavioral 
research (Hays, 1963). 

Multiple linear regression is one mathematical approach to the 
problem of prediction. Given a set of independent variables and a 
criterion variable, least squares regression weights can be calculated 
which will maximize the squared multiple correlation between the cri- 
terion vector and the predicted criterion vector (Kerlinger, 1973). 
If the variables used in the determination of the regression weights 
are transformed into z score form, then the resulting weights are 
standardized regression coefficients and sometimes are referred to 
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as beta coefficients (McNemar, 1969). In the remainder of this 
t 

papc. the symbol 3 will be used to refer to the population stan- 

f 

dardized regression coefficient and the symbol b will represent 

the sample weight which estimates it. 
t 

These b weights have been interpreted ty some researchers to 

reflect the strength and direction of the relationship between an 

t 

independent variable and the criterion. However, b weights in most 
cases are not a useful measure of the importance of a predictor var- 
iable when the independent variables are highly intercorrelated (Dar- 
lington, 1968). There Is no requirement in multiple regression anal- 
ysis that the predictor variables used In the regression equation be 
uncorrftlated or orthogonal (Johnston, 1963). Fiom a linear algebra 
perspective this is reasonable since a criterion vector (dependent 
variable) can fit perfectly into a common vector space spanned by 
basis vectors (independent variables) which are not orthogonal. (The 
criterion vector can be a linear combination of these basis elements). 
Therefore, situations may occur in regression analysis In which the 
independent variables are highly intercorrelated. The presence of 
such highly intercorrelated predictors is termed multicollinearity. 
These predictor variables are, in fact, measuring approximately the 
same thing which makes the determination of the relative influence 
of each indepe:. lent variable upon the criterion virtually Impossible 

to disentangle (Goldberger, 1968). Also, the presence of multicol- 

t 

lin.-^arity increases the standard error of b values which results in 

t 

a statistically less consistent estimator of 3 (Goldberger, 1968)* 



When exact umlticollinearity occurs, one of the Independent 
variables becomes a multiple of another. In the case of two predictor 
variables this would mean that the best fitting function which should 
be represented by a plane (see Figure 1) can instead be represented 
by a line. Again visualizing this situation from the perspective 
of linear algebra, it is evident that since linear dependencies 
cannot exist among basis elements which span a common vector space, 
the dimensional itv of the vector space would in this case be re- 
duced to two and the best fitting function would degenerate to one 
of a line. Exact xculricol linearity is rare in applied research but 
multicollinearity is a rather common occurrance. 

Statistical tests of significance can be ran to determine whether 
t 

or not a specific 6 value is different from zero in the population. 

In order to test hypotheses such as these, an assumption of normality 

must be made in the distribution of the criterion measures (Draper & 

Smith, 1966). This assumption is rarely met in psychological or 

social science research. Many variables of interest to psychologists 

and educators are extremely skewed in the population making such an 

assumption invalid. 

One of the goals of this study was to examine the effect of the 

violation of this assumption upon the probability of committing a 

t 

type II error in the testing of hypotheses based upon b coefficients. 
In order to answer this research question and the others which will 
be explained in turn, a monte carlo approach was taken. Extremely 



skewed distributions were Included In the distributions of the vari- 
ables in the populations for the purpose of making the research more 
meaningful. 

Turning once again to the problem of multlcollinearlty, one might 

consider the effect of highly correlated variables upon the outcome 

f 

of the testing of hypotheses such as 11^:3 « 0 for each independent 
variable involved in the regression equation. Ostle (1963) states 
that the F tests used in testing these hypotheses are not all indepen- 
dent since the predictor variables themselves may be correlated. This 
was another goal of the study, to Investigate the effects of multl- 
collinearlty upon the probability of committing a type II error in 
the testing of these hypotheses. 

In review the main focus of the authors was the effect of multl- 
collinearlty coupled with the violation of the assumption of normality 
in the criterion measures upon the outcome of the testing of hypotheses 
concerning population regression coefficients in the two-predictor 
regression equation. Answers were sought to the following specific 
research question: 

1. What effect does the violation of the assumption of 

normality have upon the probability of committins a 

type II error for alpha .05 in the teslting of the null 
t 

hypothesis Hq:6^ « 0 (1 « 1,2) for both small and large 
sample sizes? 

2. What effect does the presence of multlcollinearlty 
have upon the probability of committing a type II 
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error in the testing of th38e hypotheses for small 
and large samples? 
3. Does this effect (if any) change as the distribution 
sampled becomes more skewed? 
The ma^jnatical model under investigation may be written as: 

Z « 6*2- 4 g'z^ + e 
y 11 2 2 

or equivalent ly: 

Z « Z 6" + e 
y X 

where Z is an (n x 1) vector of observations in z score form 

y 

Z^ is an (n x 2) matrix of known form whose elements are also 

standardized 

6" is a (2 X 1) vector of parameters 

e is an (n x 1) vector of errors 

and where the e^ are independently and normally distributed (Draper & 

Smith, 1966). This last statement Is needed in order to test the 
t 

significance of 6 . We must also make the important assumption that 
the linear model defines the best functional fit to the data in the 
populfition. This assumption can be met by sampling from a multi- 
variate normal distribution (Blalock, 1972) which was accomplished 
through the monte carlo program. 

t 

The test of the null hypothesis that a specific 3 value was dif- 
ferent from zero was determined from the following test statistic 
(McNemar, 1969): 

p, (4--R^)/(m^-m^) 
(1 - rJ)/(N - m^ - 1) 
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where R is the multiple correlation coefficient based upon of 
1 1 

the predictor variables and R2 is the multiple correlation coefficient 

based upon m2 of the remaining variables where = m^^ -1. Sample 
t 

b values v^ere calculated using the following formulae (McNemar, 
1969) : 

12 2 2 

^ ( 1 - r J) ( 1 - r^p 

The population correlation matrices, sample sizes and population 
distributions chosen will be outlined in the next section. 

Method 

In order to ansvrer the research questions it seemed necessary 

t t 

to construct approximate sampling distributions of and b2 values 

from the sample regression equation: 

t t 

y 11 2 2 

The hypotheses dealt with the violation of the assumption of normality, 

level of colli nearity between the independent variables, sample size 

t 

and the effect of these upon the hypothesis testing of 3 • Three 

different distributions were chosen from which to generate random 

2 

samples of z scores; the multivariate normal, x with 5 degrees of 
2 

freedom and x with 20 degrees of freedom. Tliree different levels 
of intercorrelation between the predictor variables were chosen: 
^12 " addition two different sample sizes were 

selected: n » 25 and n » 100. 
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The basic element in the monte carlo procedure was the intercor- 

relation between the independe*. t variables in the population. At 

one level of inter correlation between and Z2 different levels of 

correlation between Z and were selected as were different levels 

y 1 

of correlation between Z^ and Z^^. Thirty-four different triplets 
of population intercorrelations among Z^ » Z^ and Z2 were selected and 
are displayed In Table 1* Five cases involved a p^^ value of •95» 
fourteen cases involved a value of .70 and fifteen cases involved 
a value of .45« These triplets of population Pearson Product- 
Moment correlation coefficients were transformed into factor structure 
matrices which were then used as Input into a monte carlo program 
written by the main author and based upon a previously developed 
Frotran program (Wherry, 1965). By focusing in on one of the popula- 
txon correlation matrices, the logic behind the monte carlo technique 
can be more easily explained and comprehended* 

For one set of fixed p^j^, ^^^^^^ ^ factor structure 

matrix was calculated and a distribution and sample size were chosen 
for generating sample r^^^, r^2 ^j^2 ^^1"^®* Because the authors 
were interested in examining standardized regression coefficients which 

are based upon z score values, these sample r coefficients were all 

t t 

that was needed in order to calculate and b2 coefficients for a 
sample regression equation. Five-hundred sample correlation matrices 
were produced for each selected distribution and sample size, therefore 
five-hundred sample regression equations in z score form were developed 
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f 

for the population regression equation. The five-hundred b^^ coeffi- 
cients were then used to form an approximate sampling distribution for 
bj^. The same procedure was followed for h^* 

As each sample b* value was produced, an F test was used to 
determine if the regression weight was significantly different from 
zero at the .05 level of significance. This information was tabulated 
and used in the calculation of the empirical probability of committing 
a type II error: which was estimated by taking the proportion of b* 
values which were retained in the hypothesis testing procedure. All 
the population values present in this study (see TaMe 1) were 
different from zero. Therefore, the only kind of error which could 
be examined was type II error; the probability of retaining a false 
hypothesis. 

For each factor structure matrix six approximate sampling vlis- 
t 

tributions for b^^ were developed and six approximate sampling d:lstri- 

buttons for were simultaneously developed. One was formed for each 

2 

conioination of distribution and n size: multivariate normal, and 
2 

X2q; n«100 and n='25. Since there were thirty-four factor structures 

in total, two- hundred and four approximate sampling diotributions were 

formed for each b* coefficient. 

Characteristics of the sampling distributions, population p 

t 

values, distributional type, sample size and population 3 values 
were examined for the presence of relationships in accordance with 
the research hypotheses. Table 2 through Table /a contain the summary 
statistics of the sampling distributions of each b*. 
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Results 



Table 2 and Table 3 consist of calculations based upon the bias 



involved in each sampling distribution. Since the mof* ' 



ved in 



the regression procedure was fixed, the mean of each s^^iu^^iing distri- 
bution of b' should equal the population 3' value. In Table 2 and 
Table 3, however, there is evidence of bias. The average bias. 



the aafciaum bias is .051. Since each sampling distribution involved 
a finite number of b* values and was, therefore, only approximate, 
it would seem logical to attribute the presence of bias to the approxi- 
mation technique. By scanning each table across distributional shape, 
(Dlst. Type), there appears to be little difference in the reported 
statistics and no consistent pattern appears as the deviation from 
normality becomes more marked. A Spearman correlation coefficient 
was calculated between bias and distribution shape and was found to 
be non significant in all cases, (see Table 8). Likewise, by scanning 
the columns of Table 2 and Table 3 there appears to be little differ- 
ence in the reported statistics. A Spearman correlation coefficient 
was calculated between bias and level of intercorrelation between 
predictors in the population. This coefficient was also found to be 
nonsignificant in all but one case, (see Table 8). 

Tables 4 and 5 contain statistics on the standard deviations 
of the sampling distributions of the b* values. Scanning across each 
table from left to right there appears to be little change in the 
average of the standard errors for the b* coefficients. The Spearman 



whether mean or median, is slight: 
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correlation coefficient calculated between empirical standard error, 
(S and S ), and distributional type was not found to be significant. 
As was expected, however, there is a significant correlation between 
the standard error of each b* sampling distribution and the level 
of intercorrelation present between the independent variables in 
the population, (see Table 8). By examination of Tables ^< and 5 one 
can see a decrease in the average standard error of the sampling dis- 
tributions of the b* values as the value decreases from .95 to .45. 
This decrease is consistent for a sample size of 25 and a sample size 
of 100 regardless of the distribution samplea. As the value 
decreases, the spread of the standard error values for the distri- 
butions also decreases as indicated by the standard deviation statis- 
tics. Al i iui tU i i u i LiuftL 3LauJui.J v€ llu uiu wi plinQ r^lnfrlhutinng » ■ 
^ IlUlii [)^2 l^ ^ t^l B lu Biimllei. fui au u jIll uf 100 ■ ■ 

In Table 6 and Table 6a there appear statistics calculated on 
difference values obtained by subtracting the theoretical probability 
of committing a type II error from the empirical proportion of false 
hypotheses which were retained. Again, there seems to be little change 
among the average of the difference values as the shape ot the dis- 
tributions sampled becomes more skewed. However, as decreases, 
the average difference between empirical and theoretical probability 
of committing a type II .rror also decreases. The maximum difference 
appears when ^Q^^l^ maximum difference at this level is 

.502. As decreases to .45, the maximum difference is found to be 
.192. The spread of the difference values decreases as the p^2 value 
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decreases from .95 to .45. A significant correlation was found to 
exist between the difference values, (Dif f ^^(25) ,Dif f2(25)) , and 
for a sample size of 25. As the sample size increased to 100, the 
correlation was found to be non-significant. These difference values 
can be attributed to the approximation of the monte carlo technique. 
When relatively low and the sample size was large, the ap- 

proximation technique was much more accurate. 

Table 7 and Table 7a contain the proportion of times the null 
hypothesis was falsely retained; an approximation of the probability 
of committing a type II error. As the distribution becomes more 
skewed, there is no significant change in the average proportion of 
times a false hypothesis was retained regardless of sample size. The 
Spearman correlation coefficient calculated between empirical propor- 
tion of type II errors committed and distributional shape was found 
to be non-significant regardless of sample size. The largest Spearman 



As the value decreases, the probability of committing a type 
II error also decreases as would be expected. This finding is con- 
sistent for all distributions sampled for both sample sizes. The 



found was .03. 




averagti type II error unmp'^ 



within a level of p 



12 



is smaller for a sample size of 100 than for one of 25. 
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Conclusions and Implications 

The results Illustrate that a departure from normality in the dis- 
tribution from which nindom samples are selected for inclusion in a re- 
gression equation with two predictors does not significantly influence 
the probability of committing a type II error in the testing of the 
null hypothesis HqIB^ = ^5 1»2). Because the assumption of nor- 

mality can rarely be met in the distribution of psychological and 
educational variables, and if it seems plausible to generalize beyond 
two mdependtrit variables, the results indicate that this violation 
should not be of great concern to a researcher. 

Level of intercorrelation confounded with a departure from nor- 
mality did not significantly influence the probability of committing 
a type II error either. 

As was expected, multicollinearity does have an effect upon the 
sampling distribution of b* values. This fact is consistent with the 
theory behind the effects of multicollinearity upon distributions of 
standardized regression coefficients. The more highly the predictor 
variables are correlated, the larger the standard error of the b* values. 
This implies that a confidence interval around a b* value for the pur- 
pose of estimating 3* would have to be much larger in the case of a 
gression equation with an value which is exceedingly high.^The 
smaller the amount of collinearity between two predictors and the larger 

the sample size, the more statistically consistent the b* values are: 

f 

in other words the probability that the b* value is close to the 6 
value of the population regression equation is increased. 
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Based upon the findings of this research report it would seem 

that researchers dealing with variablas selected from populations 

with extremely skewed distributions do not have to be concerned with 

any detrimental effects uoon the probability of committing a type II 

error. However, with small sample sizes and highly correlated^pre- 

dictors, generalizations about the contribution of an independent 

variable to any regression equation should be made with caution. 

Sample b' values in situations such as these are subject to extreme 

fluctuation and, although they are unbiased in the long run, most 

researchers are dealing with only one regression equation and, there 

t 

fore, only one estimate of any population 3 value. 
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Taole 1 

Population Intercorrela*.lons Specified in Monte Carlo Procedure 
and Accompanying Theoretical Standardized Regressiou Weights 
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p is the population correlation between the criterion 
variable, z , and the predictor variable, popu- 
lation correlation between the criterion variable, z , and the 
predictor variable, z^. Py^^ is the population correlation between 
the independent variables, and z^. These population correlations 
were utilized in the determination of factor structure matrices for 
input Into the Monte Carlo technique. There are five factor struc- 
ture matrices which have a value of .95, fourteen which have 
a p-^ value of .70 and fifteen which have a p-^ value of .45. 
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Table 8 

Speannan Correlation Coefficients' 



27 







Dlst. 
Type«^ 


d 

^12 






Dlst. 
Type 


^12 


b 

Blas^ 




-.02 
p<.43 


-.02 
p<.41 


Blas^ 




.04 
p<.34 


-.19 
*p<.03 


Blas^ 




.05 
p<.31 


-.10 
p<.16 


Bias 2 




-.00 
p<.49 


.04 
p<.36 


Dlffj^ 


(25)'' 


.05 
p<.31 


.28 
*p< . 00 


Dlffj^ 


(100) 


-.00 
p<.48 


.06 
p< . 


Dlff2 


(25) 


.02 
p<.41 


.24 
*p<.01 


Dlff^ 


(100) 


.00 

p<.50 


-.00 
p<.49 


S ' 
^1 




.05 

p<.32 


.60 
*p< . 00 


S 

^1 




.03 
p<.38 


.59 
*p< . 00 


S 

^2 




.06 
p<.29 


.59 
*p< . 00 


S 

"2 




.02 
p<.41 


.58 
*p< .00 



N = 



(25) 



(100) 



^Soiuft of the correlation coefficients tabled were calculated on 
variables whose elements involve statistics of sampling distributions. 
These statistics were tabulated from regression equations originally 
involving a sample size of 25 or a sample size of 100. The number of 
cases upon which the significance was determined was 102: the number 
of factor structures (34) multiplied by the number of distributions 
sampled (3), which equals the number of sampling distributions examined. 

^See notes tables 2 and 3. 

^Dist. Type refers to the r^haps of the population from which the 
z Hcores were generated for input into the regression equations for the 
purpose of constructing sampling diflfributions. Three distributions 
were* involved: normal, and x^^- 

is the population correlation between the predictor variables. 
Three levels were examined: .95, .70 and .45. 

^Diff (25) can varv between zero and one and was calculntod by sub- 
tracting tiieoretical probability of committing n type 11 error from 
the empirical proportion of type II errors committed In the tOHtin^ of 
the hypothesis H.:6! « 0, at a = .OS. Dlff (25) was determined In th<.- 
same manner for ?he'hypothesis H^:!^' « 0, a's was Dlff^dOO) and Diff^OOO). 

iH the empirical standard deviation of thr sampling dlHtrlbu- 

tlon of values. 

^Significant at a - .05. 
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Table 9 

a 

Pearson Correlation Coefficients 







BiaSj^ ' 


BiaS2 


Diffj^ 
(25)e 


Diff2 
(25) 


s ^ 
1 


s 

2 






1.00 


-.65 


.13 


.20 


• 07 


• 08 








An nn 

"p • uu 


n 1Q 
p • X7 


*n OS 


p<, 49 


p< .40 


Bias2 




-.65 


l.OC 


-.30 


-.29 


-.29 


-•28 










*n< on 

• \J\J 


*p< . 00 


*p< .00 


*p<.01 


Diff 


(25) 


.13 


-.30 


1. JO 


.86 


.72 


• 70 






p<.19 


*p<.00* 




*p<.00 


X ^ Art 

*p< . 00 


^p< .UU 


Dif f 


(25) 


.20 


-.29 


.86 


1.00 


• 66 


.70 


2 




*p<.05 


*p<.00 


*p<.00 




*p<.00 


*p<^00 


s 




.07 


-.29 


.72 


.66 


1-00 


.99 


^1 




p;.49 


*p<.00 


*p<.00 


*p<.00 




*p<^00 


s 




.08 


-.28 


.70 


.70 


• 99 


1^00 


«2 




p<.40 


*p<.01 


*p.<00 


*p<.00 


*p<^00 




N - 








(25) 







(See notes table 8) 



*Significant at a = .05. 
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Table 10 
Pearson Correlation Coefficients 



Diff^ Diff2 

Bias J'* Bias (100)^ (100) S ^ S 

^1 ^2 



Biasj^ 




1.00 


-.65 
*p<.00 


-.12 
p<.21 


-.06 
p<.56 


-.22 
*p<.03 


-.22 
*p<.02 


BiaS2 




-.65 
*p<.00 


1.00 


-.06 
p<.57 


-.10 
p<.34 


-.06 
p<.57 


-.02 
p<.81 


Diff 


(100) 


-.12 
p<.21 


-.06 
p<.57 


1.00 


.65 
*p<.00 


.57 
*p<.00 


.58 
*p<.00 


Diff2 


(100) 


-.06 
p<.56 


-.10 
p<.34 


.65 
*p<.00 


1.00 


.59 
*p<.00 


.57 
*p<.00 


S 




-.22 
*p<.03 


-.06 
p<.57 


.57 
*p<.00 


.59 
*p<.00 


1.00 


.99 

*p<.00 


S 

"2 




-.22 
*p<.02 


-.02 
p<.81 


.58 

*p< . 00 


.57 
*p<.00 


.99 

*p<.00 


1.00 



N = (100) 



(See notes table 8) 
*Slgnlf leant at ct « .05. 



31 



